library(tidyverse)
## Warning: package 'tibble' was built under R version 3.6.2
library(visreg)
## Warning: package 'visreg' was built under R version 3.6.2
library(mgcv)
library(ISLR)
library(splines)
str(Wage)
## 'data.frame': 3000 obs. of 11 variables:
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75 70.5 131 154.7 75 ...
wdf <- Wage
# FItting linear regression model
m <- lm(wage~age+year+education+jobclass,wdf)
ggplot(wdf, aes(age, wage)) +
geom_point() + ggtitle("Scatter plot for Age and Wage")+
geom_smooth(method = "lm", formula = y ~x)+geom_smooth(color="red")
ggplot(wdf, aes(year,wage)) +
geom_point() + ggtitle("Scatter plot for Wage and Year")+
geom_smooth(method = "lm", formula = y ~x)+geom_smooth(color="red")
## Warning: Computation failed in `stat_smooth()`:
## x has insufficient unique values to support 10 knots: reduce k.
visreg(m, "education",gg=TRUE)
visreg(m, "jobclass", gg=TRUE)
plot(m,1)
The age is indicating a non linear relationship with wage, Year is indicating linear pattern when plotted with wage. Education and job class is showing linear relationship across the categories. However, when we fitted a model with all predictors, residual vs fitted plots shows an evidence of slightly non-linear behaviour from the effect of age.
m1 <- gam(wage ~ age+year+education+jobclass,data=wdf)
summary(m1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## wage ~ age + year + education + jobclass
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.089e+03 6.483e+02 -3.222 0.001286 **
## age 5.485e-01 5.719e-02 9.591 < 2e-16 ***
## year 1.071e+00 3.233e-01 3.314 0.000931 ***
## education2. HS Grad 1.116e+01 2.473e+00 4.511 6.71e-06 ***
## education3. Some College 2.339e+01 2.614e+00 8.947 < 2e-16 ***
## education4. College Grad 3.834e+01 2.616e+00 14.655 < 2e-16 ***
## education5. Advanced Degree 6.275e+01 2.871e+00 21.859 < 2e-16 ***
## jobclass2. Information 4.575e+00 1.379e+00 3.318 0.000917 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 0.263 Deviance explained = 26.5%
## GCV = 1286.9 Scale est. = 1283.5 n = 3000
#partial Residual plot
visreg(m1,"age")
Partial plots suggests that the relationship of age with wage variable is linear.
for the difference between age 20 and 60 , it could be counted as 40(coef of ages) = 400.548 = 21.92
par(mfrow = c(2,2))
gam.check(m1)
##
## Method: GCV Optimizer: magic
## Model required no smoothing parameter selectionModel rank = 8 / 8
The qqplot and histogram of residuals shows that the residuals are not normally distributed, which is the assumption of linear regression. To improve the model, we can take log of dependnet variable and run the model again.
#using log transformation to improve the model
m_revised <- gam(log(wage) ~ age+year+education+jobclass,data=wdf)
# Improved m1 model
m2 <- m_revised
par(mfrow = c(2,2))
gam.check(m2)
##
## Method: GCV Optimizer: magic
## Model required no smoothing parameter selectionModel rank = 8 / 8
It can be observed from the qq-plots and histogram that taking the log of wage improved the model fit significantly. As, all the assumption of linear regression model are fullfilled, qqplot shows and histogram plot shows that the residuals are normally distributed.
m3<- lm(log(wage) ~ ns(age,20)+ns(year,4)+education +jobclass,data=wdf)
summary(m3)
##
## Call:
## lm(formula = log(wage) ~ ns(age, 20) + ns(year, 4) + education +
## jobclass, data = wdf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.72890 -0.15628 0.01029 0.16827 1.10332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.91062 0.06151 63.582 < 2e-16 ***
## ns(age, 20)1 0.37329 0.06460 5.779 8.31e-09 ***
## ns(age, 20)2 0.39297 0.08322 4.722 2.44e-06 ***
## ns(age, 20)3 0.34655 0.07505 4.618 4.04e-06 ***
## ns(age, 20)4 0.45854 0.07788 5.888 4.36e-09 ***
## ns(age, 20)5 0.43635 0.08356 5.222 1.89e-07 ***
## ns(age, 20)6 0.52892 0.08252 6.409 1.69e-10 ***
## ns(age, 20)7 0.51121 0.07715 6.626 4.08e-11 ***
## ns(age, 20)8 0.48661 0.07860 6.191 6.81e-10 ***
## ns(age, 20)9 0.54916 0.07768 7.069 1.93e-12 ***
## ns(age, 20)10 0.44376 0.08392 5.288 1.33e-07 ***
## ns(age, 20)11 0.59027 0.07804 7.564 5.19e-14 ***
## ns(age, 20)12 0.44807 0.07593 5.901 4.02e-09 ***
## ns(age, 20)13 0.54939 0.08024 6.847 9.12e-12 ***
## ns(age, 20)14 0.42331 0.07832 5.405 7.00e-08 ***
## ns(age, 20)15 0.55262 0.07745 7.135 1.21e-12 ***
## ns(age, 20)16 0.45405 0.07623 5.957 2.88e-09 ***
## ns(age, 20)17 0.51748 0.07017 7.374 2.13e-13 ***
## ns(age, 20)18 0.38302 0.07573 5.058 4.49e-07 ***
## ns(age, 20)19 0.48202 0.14748 3.268 0.001094 **
## ns(age, 20)20 0.19592 0.10297 1.903 0.057171 .
## ns(year, 4)1 0.09556 0.02890 3.306 0.000957 ***
## ns(year, 4)2 0.04256 0.02464 1.728 0.084177 .
## ns(year, 4)3 0.09242 0.03513 2.630 0.008571 **
## ns(year, 4)4 0.06217 0.02001 3.108 0.001904 **
## education2. HS Grad 0.11076 0.02029 5.459 5.18e-08 ***
## education3. Some College 0.22322 0.02150 10.384 < 2e-16 ***
## education4. College Grad 0.33597 0.02152 15.613 < 2e-16 ***
## education5. Advanced Degree 0.49568 0.02359 21.016 < 2e-16 ***
## jobclass2. Information 0.03482 0.01129 3.083 0.002068 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2924 on 2970 degrees of freedom
## Multiple R-squared: 0.3157, Adjusted R-squared: 0.3091
## F-statistic: 47.26 on 29 and 2970 DF, p-value: < 2.2e-16
# Plt visualization
visreg(m3,"age",gg=T)+ggtitle("Age")
visreg(m3,"year",gg=T)+ggtitle("Year")
The plot of age shows that the log wages are increasing till 50 years of age and then there is slightly decreasing patteren.The Year plot shows that the wages are increasing lineearly by increarsing the year.
# penalized
m4 <- gam(log(wage) ~ s(age)+year+education+jobclass,method="REML",data=wdf)
summary(m4)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(wage) ~ s(age) + year + education + jobclass
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.990414 5.305136 -3.768 0.000168 ***
## year 0.012161 0.002645 4.598 4.44e-06 ***
## education2. HS Grad 0.113547 0.020204 5.620 2.09e-08 ***
## education3. Some College 0.227946 0.021375 10.664 < 2e-16 ***
## education4. College Grad 0.338383 0.021429 15.791 < 2e-16 ***
## education5. Advanced Degree 0.498599 0.023516 21.202 < 2e-16 ***
## jobclass2. Information 0.035032 0.011258 3.112 0.001877 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 5.622 6.732 49.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.309 Deviance explained = 31.2%
## -REML = 599.61 Scale est. = 0.085461 n = 3000
anova(m4)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(wage) ~ s(age) + year + education + jobclass
##
## Parametric Terms:
## df F p-value
## year 1 21.142 4.44e-06
## education 4 172.486 < 2e-16
## jobclass 1 9.683 0.00188
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 5.622 6.732 49.03 <2e-16
par(mfrow = c(2,2))
gam.check(m4)
##
## Method: REML Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [-0.0003521119,0.0002342288]
## (score 599.6133 & scale 0.08546121).
## Hessian positive definite, eigenvalue range [1.884817,1496.004].
## Model rank = 16 / 16
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(age) 9.00 5.62 1 0.46
visreg(m4,"age",gg=T)+ggtitle("Age")
Again the model is fitted regression splines. The model diagnostic shows that the assumptions are fulfilled and again we can we can see that the age plot showing almost similar behavior, as in the model with knots.
m5 <- gam(log(wage) ~ s(age,by= jobclass)+year+education+jobclass,data=wdf)
summary(m5)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(wage) ~ s(age, by = jobclass) + year + education + jobclass
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.957255 5.304042 -3.763 0.000171 ***
## year 0.012145 0.002644 4.593 4.56e-06 ***
## education2. HS Grad 0.112779 0.020219 5.578 2.65e-08 ***
## education3. Some College 0.227836 0.021387 10.653 < 2e-16 ***
## education4. College Grad 0.338561 0.021452 15.783 < 2e-16 ***
## education5. Advanced Degree 0.499573 0.023529 21.232 < 2e-16 ***
## jobclass2. Information 0.035283 0.011262 3.133 0.001747 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age):jobclass1. Industrial 4.635 5.673 37.62 <2e-16 ***
## s(age):jobclass2. Information 4.340 5.355 21.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.309 Deviance explained = 31.3%
## GCV = 0.085919 Scale est. = 0.085461 n = 3000
# Pl0t dignostic
par(mfrow=c(2,2))
gam.check(m5)
##
## Method: GCV Optimizer: magic
## Smoothing parameter selection converged after 5 iterations.
## The RMS GCV score gradient at convergence was 4.751158e-07 .
## The Hessian was positive definite.
## Model rank = 25 / 25
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(age):jobclass1. Industrial 9.00 4.63 1 0.42
## s(age):jobclass2. Information 9.00 4.34 1 0.51
# Interaction plot
visreg(m5,"age",by= "jobclass",gg=T, overlay = T)+ggtitle("interraction of age and jobclass")
Model diagnostics not suggesting any problem in the model. However, we can see that the interaction plots. The plot of age with industrial jobs shows that the wages are increasing from the age of 20 and start decreasing from the age of 50. The interaction plot of age with Information shows that the wages are increasing till 40 years of age and then slightly decreasing.
For this purpose we applied the function of anova on the model, whic provides the P-values of the coefficents.
anova(m5,m4,test = "F")
## Analysis of Deviance Table
##
## Model 1: log(wage) ~ s(age, by = jobclass) + year + education + jobclass
## Model 2: log(wage) ~ s(age) + year + education + jobclass
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 2982.0 255.02
## 2 2985.9 255.31 -3.9532 -0.2869 0.8492 0.4928
#without interraction
revised_m5 <- gam(log(wage) ~ s(age)+year+education,data=wdf)
The P-value is: 0.4928, greater than 0.05. So, according to our criterion, we would not reject the null hypothesis, which suggests that the interaction terms is not necessary and the model without interaction is better, in this way.
revised <- gam(log(wage) ~ s(age)+year+education,data=wdf)
#test for non-linear
anova(m2,revised,test = "F")
## Analysis of Deviance Table
##
## Model 1: log(wage) ~ age + year + education + jobclass
## Model 2: log(wage) ~ s(age) + year + education
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 2992.0 272.46
## 2 2987.6 256.18 4.4034 16.282 43.139 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The P-value is really small, less than 0.05. So, according to our criterion, we would reject the null hypothesis, which suggests that there is non-lienarity thus the model with smoothing function is better
d1 <- data.frame(year=2009,education="2. HS Grad",age=20,jobclass="1. Industrial")
# Predicting Age 20
predict.gam(revised,d1,se=T)
## $fit
## 1
## 4.205112
##
## $se.fit
## 1
## 0.02797235
# Predicting Age 60
d2 <- data.frame(year=2009,education="2. HS Grad",age=60,jobclass="1. Industrial")
# Predicting Age 60
predict.gam(revised,d2,se=T)
## $fit
## 1
## 4.605911
##
## $se.fit
## 1
## 0.01895458
From the prediction we know that the predicted log(wage) for age 20 is 4.20 and log(wage) for age 60 is 4.60, removing the log for both of them get us the number of wage, for age 20 = e^4.20 = 66 wage, for age 60 = e^4.60 = 99